1,662 research outputs found

    Using natural language processing to improve biomedical concept normalization and relation mining

    Get PDF
    This thesis concerns the use of natural language processing for improving biomedical concept normalization and relation mining. We begin with introducing the background of biomedical text mining, and subsequently we will continue by describing a typical text mining pipeline, some key issues and problems in mining biomedical texts, and the possibility of using natural language procesing to solve the problems. Finally we end an outline of the work done in this thesis

    Training text chunkers on a silver standard corpus: Can silver replace gold?

    Get PDF
    Background: To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. The creation of gold standard corpora (GSCs), however, is expensive and time-consuming. GSCs therefore tend to be small and to focus on specific subdomains, which limits their usefulness. We investigated the use of a silver standard corpus (SSC) that is automatically generated by combining the outputs of multiple chunking systems. We explored two use scenarios: one in which chunkers are trained on an SSC in a new domain for which a GSC is not available, and one in which chunkers are trained on an available, although small GSC but supplemented with an SSC.Results: We have tested the two scenarios using three chunkers, Lingpipe, OpenNLP, and Yamcha, and two different corpora, GENIA and PennBioIE. For the first scenario, we showed that the systems trained for noun-phrase recognition on the SSC in one domain performed 2.7-3.1 percenta

    ContextD: An algorithm to identify contextual properties of medical terms in a dutch clinical corpus

    Get PDF
    Background: In order to extract meaningful information from electronic medical records, such as signs and symptoms, diagnoses, and treatments, it is important to take into account the contextual properties of the identified information: negation, temporality, and experiencer. Most work on automatic identification of these contextual properties has been done on English clinical text. This study presents ContextD, an adaptation of the English ConText algorithm to the Dutch language, and a Dutch clinical corpus. Results: The ContextD algorithm utilized 41 unique triggers to identify the contextual properties in the clinical corpus. For the negation property, the algorithm obtained an F-score from 87% to 93% for the different document types. For the experiencer property, the F-score was 99% to 100%. For the historical and hypothetical values of the temporality property, F-scores ranged from 26% to 54% and from 13% to 44%, respectively. Conclusions: The ContextD showed good performance in identifying negation and experiencer property values across all Dutch clinical document types. Accurate identification of the temporality property proved to be difficult and requires further work. The anonymized and annotated Dutch clinical corpus can serve as a useful resource for further algorithm development

    Knowledge-based extraction of adverse drug events from biomedical text

    Get PDF
    Background: Many biomedical relation extraction systems are machine-learning based and have to be trained on large annotated corpora that are expensive and cumbersome to construct. We developed a knowledge-based relation extraction system that requires minimal training data, and applied the system for the extraction of adverse drug events from biomedical text. The system consists of a concept recognition module that identifies drugs and adverse effects in sentences, and a knowledg

    Using rule-based natural language processing to improve disease normalization in biomedical text

    Get PDF
    Background and objective: In order for computers to extract useful information from unstructured text, a concept normalization system is needed to link relevant concepts in a text to sources that contain further information about the concept. Popular concept normalization tools in the biomedical field are dictionarybased. In this study we investigate the usefulness of natural language processing (NLP) as an adjunct to dictionary-based concept normalization. Methods: We compared the performance of two biomedical concept normalization systems, MetaMap and Peregrine, on the Arizona Disease Corpus, with and without the use of a rule-based NLP module. Performance was assessed for exact and inexact boundary matching of the system annotations with those of the gold standard and for concept identifier matching. Results: Without the NLP module, MetaMap and Peregrine attained F-scores of 61.0% and 63.9%, respectively, for exact boundary matching, and 55.1% and 56.9% for concept identifier matching. With the aid of the NLP module, the F-scores of MetaMap and Peregrine improved to 73.3% and 78.0% for boundary matching, and to 66.2% and 69.8% for concept identifier matching. For inexact boundary matching, performances further increased to 85.5% and 85.4%, and to 73.6% and 73.3% for concept identifier matching. Conclusions: We have shown the added value of NLP for the recognition and normalization of diseases with MetaMap and Peregrine. The NLP module is general and can be applied in combination with any concept normalization system. Whether its use for concept types other than disease is equally advantageous remains to be investigated

    Probing Shadowed Nuclear Sea with Massive Gauge Bosons in the Future Heavy-Ion Collisions

    Get PDF
    The production of the massive bosons Z0Z^0 and W±W^{\pm} could provide an excellent tool to study cold nuclear matter effects and the modifications of nuclear parton distribution functions (nPDFs) relative to parton distribution functions (PDFs) of a free proton in high energy nuclear reactions at the LHC as well as in heavy-ion collisions (HIC) with much higher center-of mass energies available in the future colliders. In this paper we calculate the rapidity and transverse momentum distributions of the vector boson and their nuclear modification factors in p+Pb collisions at sNN=63\sqrt{s_{NN}}=63TeV and in Pb+Pb collisions at sNN=39\sqrt{s_{NN}}=39TeV in the framework of perturbative QCD by utilizing three parametrization sets of nPDFs: EPS09, DSSZ and nCTEQ. It is found that in heavy-ion collisions at such high colliding energies, both the rapidity distribution and the transverse momentum spectrum of vector bosons are considerably suppressed in wide kinematic regions with respect to p+p reactions due to large nuclear shadowing effect. We demonstrate that in the massive vector boson productions processes with sea quarks in the initial-state may give more contributions than those with valence quarks in the initial-state, therefore in future heavy-ion collisions the isospin effect is less pronounced and the charge asymmetry of W boson will be reduced significantly as compared to that at the LHC. Large difference between results with nCTEQ and results with EPS09 and DSSZ is observed in nuclear modifications of both rapidity and pTp_T distributions of Z0Z^0 and WW in the future HIC.Comment: 13 pages, 21 figures, version accepted for publication in Eur. Phys. J.

    The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers

    Get PDF
    The production of gold standard corpora is time-consuming and costly. We propose an alternative: the 'silver standard corpus' (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15, 956, 841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus

    MCL-CAw: A refinement of MCL for detecting yeast complexes from weighted PPI networks by incorporating core-attachment structure

    Get PDF
    Abstract Background The reconstruction of protein complexes from the physical interactome of organisms serves as a building block towards understanding the higher level organization of the cell. Over the past few years, several independent high-throughput experiments have helped to catalogue enormous amount of physical protein interaction data from organisms such as yeast. However, these individual datasets show lack of correlation with each other and also contain substantial number of false positives (noise). Over these years, several affinity scoring schemes have also been devised to improve the qualities of these datasets. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining datasets from multiple sources and by making use of these affinity scoring schemes. In the attempt towards tackling this challenge, the Markov Clustering algorithm (MCL) has proved to be a popular and reasonably successful method, mainly due to its scalability, robustness, and ability to work on scored (weighted) networks. However, MCL produces many noisy clusters, which either do not match known complexes or have additional proteins that reduce the accuracies of correctly predicted complexes. Results Inspired by recent experimental observations by Gavin and colleagues on the modularity structure in yeast complexes and the distinctive properties of "core" and "attachment" proteins, we develop a core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes from scored (weighted) PPI networks. We combine physical interactions from two recent "pull-down" experiments to generate an unscored PPI network. We then score this network using available affinity scoring schemes to generate multiple scored PPI networks. The evaluation of our method (called MCL-CAw) on these networks shows that: (i) MCL-CAw derives larger number of yeast complexes and with better accuracies than MCL, particularly in the presence of natural noise; (ii) Affinity scoring can effectively reduce the impact of noise on MCL-CAw and thereby improve the quality (precision and recall) of its predicted complexes; (iii) MCL-CAw responds well to most available scoring schemes. We discuss several instances where MCL-CAw was successful in deriving meaningful complexes, and where it missed a few proteins or whole complexes due to affinity scoring of the networks. We compare MCL-CAw with several recent complex detection algorithms on unscored and scored networks, and assess the relative performance of the algorithms on these networks. Further, we study the impact of augmenting physical datasets with computationally inferred interactions for complex detection. Finally, we analyse the essentiality of proteins within predicted complexes to understand a possible correlation between protein essentiality and their ability to form complexes. Conclusions We demonstrate that core-attachment based refinement in MCL-CAw improves the predictions of MCL on yeast PPI networks. We show that affinity scoring improves the performance of MCL-CAw.http://deepblue.lib.umich.edu/bitstream/2027.42/78256/1/1471-2105-11-504.xmlhttp://deepblue.lib.umich.edu/bitstream/2027.42/78256/2/1471-2105-11-504-S1.PDFhttp://deepblue.lib.umich.edu/bitstream/2027.42/78256/3/1471-2105-11-504-S2.ZIPhttp://deepblue.lib.umich.edu/bitstream/2027.42/78256/4/1471-2105-11-504.pdfPeer Reviewe

    Constraints on Spin-Independent Nucleus Scattering with sub-GeV Weakly Interacting Massive Particle Dark Matter from the CDEX-1B Experiment at the China Jin-Ping Laboratory

    Full text link
    We report results on the searches of weakly interacting massive particles (WIMPs) with sub-GeV masses (mχm_{\chi}) via WIMP-nucleus spin-independent scattering with Migdal effect incorporated. Analysis on time-integrated (TI) and annual modulation (AM) effects on CDEX-1B data are performed, with 737.1 kg\cdotday exposure and 160 eVee threshold for TI analysis, and 1107.5 kg\cdotday exposure and 250 eVee threshold for AM analysis. The sensitive windows in mχm_{\chi} are expanded by an order of magnitude to lower DM masses with Migdal effect incorporated. New limits on σχNSI\sigma_{\chi N}^{\rm SI} at 90\% confidence level are derived as 2×2\times10327×^{-32}\sim7\times1035^{-35} cm2\rm cm^2 for TI analysis at mχm_{\chi}\sim 50-180 MeV/c2c^2, and 3×3\times10329×^{-32}\sim9\times1038^{-38} cm2\rm cm^2 for AM analysis at mχm_{\chi}\sim75 MeV/c2c^2-3.0 GeV/c2c^2.Comment: 5 pages, 4 figure
    corecore